So welcome to the dark arts of OSINT.
This would be Dr. Noah Schiffman, a.k.a. Security Freak.
He is the academic of our team, the one who actually finished college.
Yes, he is way more intelligent than I am, a snappy dresser, and an absolutely wonderful guy.
Can I have a red shirt and kick the shit out of this guy over here?
Evan Davison.
Sorry.
Phone is off.
All right, there we go. Not like they didn't tell us that earlier.
And I'm Skydog, of course, by the picture there.
We are part of the Dead Bunny Club, whether you've heard of that or not.
It is the pseudo-philanthropic arm of everything Skydog does.
So we got together. I met you a couple years ago, and we found that we're fast friends,
and we have a lot of fun getting together and getting into major trouble.
Sometimes a little more than friends.
Do what?
Sometimes a little more than friends.
We weren't going to talk about that.
Sorry.
I took that out of the presentation.
All right.
Okay, sorry, sorry.
He's a great cuddler, though.
Yes.
So, okay.
So, okay.
As they announced earlier, this is my 11th year coming to DEF CON.
I actually was back in the AP days.
Who's been to the AP days?
A round of applause.
Everyone's a newbie.
That's wonderful.
I just heard about DEF CON like two weeks ago.
Two weeks ago?
Yeah.
That's how that works.
Yeah.
So I get to celebrate, ironically, at my 11th year here.
I've actually been a goon for nine years.
For my 11th year here, I get to three.
I celebrate three firsts, which is kind of odd.
It's not losing my virginity.
Don't worry.
It'll happen soon.
I'm hoping.
I'm really holding out.
One day.
I understand I have to talk to a girl, though, and I'm not ready for that.
So the first one, it was really wonderful.
My son got to participate in DEF CON Kids.
I'm old enough now that I have offspring.
Cooper.
He got to celebrate in DEF CON.
He got to celebrate in DEF CON Kids.
He placed fourth in social engineering and second in Hacker Jeopardy.
So definitely a first for me.
My second would be my first Mohawk ever.
I got to participate in Mohawk Con this year.
So a round of applause for those guys.
They did an absolutely wonderful job.
I had to leave Vanderbilt to actually be able to make that one happen.
And, of course, my third is actually being accepted to speak at DEF CON.
Which is a great honor.
I did find out that they do require you to submit a paper.
That's why it took me so long.
I didn't read the fine print.
But here we are.
We're talking about our live demo.
So there's this live demo thing that maybe kind of discussed in the CFP and brochure.
Well, so I don't know how many people here are familiar with something called MATLAB.
Or R.
Or I don't know.
Other letters of the alphabet.
Yes.
What's your favorite letter?
No.
So I didn't have a licensed copy of MATLAB.
And went with Octave.
And got into a battle with Octave.
And Octave won.
And I lost.
So we're doing a different kind of live demo.
That's sort of audience participation based.
So it's going to be really fun.
And everyone is going to get to meet people.
Sitting next to you.
It's going to be a fun icebreaker.
No, it's not.
But it's going to be a demo that we can all participate in.
And make a point.
So I hate Octave.
Hate it.
Okay.
So that's all.
Yeah.
Ready?
Get loose.
Here we go.
So our talk today is about the dark arts of OSINT.
So the path we're going to take.
We're going to talk about what is OSINT.
We're going to move on to.
Evan, if you call me again.
I'll fucking kill you.
I swear.
Fucking kill you.
Anyway.
I digress.
So we're going to speak about what is OSINT.
We're going to talk about some acquisition tools and techniques.
I'm then going to sit down.
And the guy with the math background.
Is going to speak about anonymizing data.
And then.
That's you.
You don't remember?
I'm going to leave the stage.
And Noah is going to speak about anonymizing.
And de-anonymizing data.
So open source intelligence.
I guess I have to hit the button.
Don't I?
Open source intelligence.
Thank you for putting the pause in there.
Did you get the transitions in there?
Some.
I don't know.
The cool one that wipes?
Dissolve.
So what is open source intelligence?
Essentially open source intelligence.
Is anything out there.
That you can reach.
Without having to be a Leo or something similar.
Or belong to a large organization.
Or require paperwork to get to it.
It's anything you can get to online.
Or readily available.
Well why do you care?
Who had a picture taken of them this weekend.
By some jackass with a camera.
Not one of our photographers.
But someone with a phone or whatever.
Okay.
Guess what.
You're now hooked up with open source.
The information is out there.
You appear in a picture.
Now it's something I can catalog and index.
So congratulations.
Prism.
We weren't going to talk about that.
So how can it be optimized?
We're looking at big data sets.
One of the things that Noah is going to get to.
Is taking the big data sets.
And crunching the numbers.
And actually extracting some information.
Out of what's available.
Readily available.
OSINT comprises many things.
One of them would be text.
Whether it is emails that you sent.
Back in 73.
Where you were talking about something bizarre.
Did you send anything back?
Never mind.
I've gone back actually.
And found some of the things that I've done on forums.
Way, way, way back in the day.
Using a different name.
I was able to actually find online.
Things that probably would have shown how ignorant I was at the time.
But anyway.
You have text that's out there.
That can be searched for.
You also have imagery.
We have Facebook.
We have appearing at DEFCON.
If you don't realize it or not.
You probably had a picture taken of you at some point in time.
That appears there.
Video.
I think last night Evan played the little VR system.
Where you had to move around the map.
And begin to do the robot.
Which was an absolute hoot.
Which will appear on YouTube.
With a little bit of captioning later on.
Yeah the black cat robot.
Absolutely.
So we also have audio.
The video that we have here of this presentation.
Is currently available on DVD later.
But they also put the audio up of that.
So you can.
If you're not into driving.
Looking at your iPhone.
You can listen to the audio.
And then you have geospatial.
Would be the images you take.
From a device that's GPS enabled.
It records your longitude and latitude and altitude.
And fun things like that.
Other information that doesn't always.
Get removed from imagery.
When it's put online.
There is a certain signal to noise ratio.
If you've been online.
And you've looked for data.
A lot of times the aggregators of that data.
May have some really bizarre things.
That show up.
No I never lived in Henderson Nevada.
But for some reason.
The phone number associated with that.
So there's just a certain amount of it.
That's out there.
That doesn't really fall into place correctly.
You have to go through.
And decrease the noise.
To get the true signal.
So out of that.
Once you clean up enough data.
You're able to go through.
And put enough things together.
Layer them together.
Find where the high points in the graph appear.
You will find actionable data.
Anyone that's actually.
In the law enforcement community.
Which I'm not.
Anyone who is in that community.
Realizes that when enough data is collected.
It becomes actionable.
And then it becomes intelligence.
Something that can be used.
To actually do something.
So.
Sorry I got a little cough there.
The history and origins of.
You leaving?
Furball.
No I don't want to drink anymore.
Not yet.
Wait until you get on stage.
So print media.
Originally had newspaper clippings.
From other parts of the United States.
Someone would catalog those things.
And actually write up a report on it.
We moved into the radio age.
Things were actually transcribed.
And then cataloged and indexed.
The search time on information like that.
Was a little long.
If you want to complain about Oracle.
Or MySQL or something like that.
The paper version of it really sucked.
We moved to television.
Things that got compressed down to.
Video tape.
And things of that nature.
Like I said I recently worked for Vanderbilt.
They have the largest compendium.
Of news broadcasts.
They go back as farther than anyone else.
That information can.
Also be searched by metadata.
And then of course we're down to the internet age.
Where every jackass can get on.
Out there and dance.
And then put online their robot.
At a large security conference.
That's coming back to haunt you asshat.
So the evolution began.
New sources of course.
With radio and print.
And then we moved to government repositories.
For some reason.
They decided it would be a good idea.
To collect information and store it.
Who knew?
Then you went to academic publications.
Where they began to collect data.
And sort everything and put it together.
Theoretically they anonymized it.
And now we've moved into the age.
Of electronic databases.
Where we know everything about you.
Those are sexy.
Those will get you laid.
Definitely.
The current forms and uses of Docent.
Are definitely tool sets.
Websites you can go to.
And of course databases you can get your hands on to.
Depending on what your flavor is.
So.
Let's see.
Maltigo.
Show of hands.
Cool.
Maltigo is basically used.
You put a click.
Last time I let you do this part.
Maltigo is used basically.
Or primarily to dig down on an organization.
You can look at their who is records.
And DNS and IPs and emails.
And things of that nature.
I'm going to have someone else come up here.
And stomp your ass too.
But anyway.
Maltigo is really good for drilling down on a company.
By looking at email addresses and things.
To compile a large amount of data.
Who has used FOCA?
Anyone in the house?
If you haven't played with FOCA.
FOCA is a lot of fun.
Basically it looks at the metadata.
In Microsoft Office.
Documents.
PDFs.
It will do open office.
It actually looks at the exit metadata in pictures.
So you can begin to compile information.
Just in the hidden information.
And all the documents.
Randy from accounting.
Puts out some sort of a document.
And inside that.
It contains information about where it is stored.
On the local network.
It actually makes it to the outside world.
And gives me some information.
About how the interior network is built.
That one is a really nice fun one to play with.
Search diggity.
Anyone use that one?
Not in my backyard.
Do what?
Nothing.
It isn't used as much as everyone would like.
But it basically is another form of being able to sift through data.
It takes information from Bing and Google.
And other sources.
And sort of compiles it together.
And gives you a nice little interface.
To be able to get to it.
So a lot of different pieces of software out there.
Has anyone heard of recorded future?
Okay.
This is one of those that makes you kind of cringe a little bit.
It's a temporal analysis software.
It's an analysis engine.
It forecasts and does analysis.
To predict future events.
Based on information from social networks.
And patterns that they can find.
So they're able to go in.
And put some information in.
And actually determine what could possibly happen.
Based on information that's flowing right now.
So.
And of course there's Facebook.
Who's put their music preferences on?
Alright.
Who uses Facebook?
It's alright. We're among friends.
Big mistake.
Yeah.
Did we get a picture of that?
So if you've put onto Facebook.
Hey I like REO Speedwagon.
And for all the young guys in the crowd.
That's really a rocking band.
Hey I went to REO Speedwagon.
Well.
I can go back in with graph search now.
And say hey I want to know anyone who lives in Tennessee.
Who likes REO Speedwagon.
And blah blah blah.
And I can then mine some data out.
And I guess give you a jingle.
And you'll have a list of records.
At which point you would probably run.
So.
There are a lot of ways.
Things are actually being put out there now.
For you to be able to look at the data.
And try to grind through it.
There are other websites.
Social mention.
Spokeo.
Meltwater.
I have my own personal preferences on what to use.
Johnny Long isn't here.
But who's ever seen the Google hacking database?
Okay.
Things that people have put together.
If you're looking for certain types of information.
They've put query structures together for you to use.
This is what it's like to hang out with Noah and I.
At any point in time.
So.
Basically.
You have three different types of public data.
You have cooperatively.
I need to drink some more.
Cooperatively provided data.
Which would be.
This is my name.
Social networking.
It's what I put on Facebook.
I like REO Speedwagon and Smurfs.
It's things that you willingly.
You like Smurfs.
Sorry.
I'm worried about you.
It's things that you put out there.
Your personal preferences and things of that nature.
Posts that you've made that can actually be mine to look at.
But you've willingly given it up.
Did I say that right?
Okay.
Just checking.
Things that are confidentially provided.
I had to log in to give that information.
I filled out a questionnaire or survey.
I said.
Yes.
I'm more than happy to allow you to look at this information.
I put something in there enough.
That it's very identifiable.
Be it my address.
My phone number.
My credit card.
Things of that nature.
So you have to actually.
It's a site with a privacy policy where you say I agree to it.
So you've given that information up.
And you've agreed to their legal statement there.
And then you have the unknowingly provided.
Or the.
Wait.
Where did they get this from?
So it's the DMV records.
It's other information.
Maybe it's your medical records.
Or how the fuck did they get my APGAR scores.
So you know.
It was slow at birth.
And it never got better.
Things that are third party generated.
Government and academia.
I participated in something in college.
Where you got paid 20 bucks to get an ass probing.
Or something like that.
For research.
So they take that data.
And they put it into a database.
And they put it online.
Theoretically your name is not associated with it.
But.
So.
So who publishes these data sets?
A lot of the time it's government.
There's academia.
Now there's a commercial market now.
For data that's been pieced together.
And for a certain fee.
You can go in.
And cruise through that data.
The more you pay.
The more granular your data becomes.
And the more revealing it is.
Why are these data sets published?
For statistical analysis.
It's coming up.
Try not to laugh.
For statistical analysis.
We want to go back and look at the information.
And do some predictions.
Looking for trends and patterns that are out there.
And retrospective outcomes.
We struggle trying to find.
The proper example for this.
We decided on.
Which is better.
Viagra or Cialis.
We go back and look at the information.
We see the satisfaction.
I guess that's not the right terminology.
I said Viagra.
I heard someone say Cialis.
A buddy of mine I swear.
He said you. No. No.
It was a friend of mine too.
No it wasn't.
Evan?
Where did Evan go?
He's hiding.
That's good.
And of course this information.
Is used for decision making.
For future things.
Maybe it is product design.
Or coming up with something new.
Whether it's actually going to be.
Popular in any way shape or form.
So.
A lot of the things that are used in here.
On the websites.
I don't do the math.
That's this gentleman's side of things.
Occasionally I get asked.
To find things.
Who in the crowd.
Who finished high school?
Show of hands.
It's okay.
Who went to college?
Now who finished college?
Okay.
This is your crowd.
So anyway.
He's got one degree.
Two degrees.
Do you want to do that?
No.
So.
I did not finish college.
I had a hell of a lot of fun.
While I was there.
Per my GPA.
But what I did not learn.
While I was at college.
Is what you can and can't do.
It was not taught out of me.
When I.
Oh you can't do it that way.
So I've never heard that before.
It makes it a lot easier for me to do some things.
Like drill data on somebody.
So occasionally I'll get a phone call.
And I'll get a couple pieces of criteria.
And they say find someone.
And I've become very adept at doing so.
Using all the open source information that's out there.
So.
Is anyone stated to Bellagio?
This is audience participation.
You're awake right?
Anyone's been there?
Cabana by the refrigerated pool.
Absolutely wonderful.
If at any point in your lifetime you can make that happen.
Definitely do it.
I'm in the sun.
I've got the MacBook Air with me.
I'm trying to get on the shitty wireless there.
That does not work.
And there's a gentleman to my immediate right.
And he notices I have a computer.
Which for all of us.
That is typically the sticking point to.
Yeah dude my computer at home doesn't work.
Who's ever answered that question?
So I'm in a swimsuit.
By the pool.
And the guy starts talking to me.
Okay I'll bite.
No problem.
So we start discussing China.
Politics.
The economy.
Fun things like that really make you happy.
We have a few drinks.
And he says so you're in Vegas.
Are you here for business or pleasure?
And I said well currently for pleasure.
I would think that would be the case.
If I'm by the pool.
And he says so you're here for pleasure.
That's good.
I'm coming back out to the largest.
Hacker conference in the United States.
Called Def Con.
And you could hear his asshole pucker in the seat.
So that's one of those things.
Where who in the crowd hasn't.
Had to explain what that means.
Put your hand down asshat.
So I began to explain what Def Con is.
Since we didn't have the documentary.
It was very interesting.
Trying to explain it to him.
The hearing impaired con.
Definitely.
But I got to spend some time trying to explain to him.
Mundane actually what we do.
And why we get together for all this.
And then his jackass friend shows up.
Who has come to Vegas.
To go to the Pawn Stars.
Place downtown.
And he comes back.
Dude I got to meet Haas.
And I'm thinking okay.
Let's go get a steak.
So he becomes.
Packs everything up.
And he says.
Yeah we're going to head off and get a steak.
At so and so place.
And really nice meeting you.
Later.
And I said just a second.
I said your name is Brian.
And your family owns a construction.
Civil construction firm in Seattle Washington.
And the guy says.
Yeah.
And I said I'll send you an email to your work email.
Within the next 48 hours.
Asshole pucker.
And I said don't worry.
I said I'm going to show you.
I have two bits of information on you.
I don't have your last name.
I don't have much more than that.
But I'm going to send you an email.
And show you what's possible.
So we went out and had a nice dinner.
Went out to the pool the next day.
And at some point I thought.
I got to go find Brian.
So I sit down on the bed.
Fire up the laptop.
And in 45 minutes.
I have where he lives.
Pictures of his house.
What he paid for.
Pictures of all of his relatives.
I then took it upon myself.
To scan the exterior of his network.
And tell his system administrator.
You probably should change this.
It's not good to have this open.
Brian never responded to the email.
Oddly enough.
I didn't think it was a problem.
I didn't send him an invoice.
I did it gratis.
But that's a good example of.
I had two bits of information on the guy.
Fortunately.
One of them was unique enough.
It allowed me to find him.
I was able to correlate.
Civil construction.
Oddly enough.
Against the YouTube video.
Which I was able to pick this guy out in.
And from there just went to town on him.
So I guess if you get an email from a guy.
That you met by the pool.
Who says he's a hacker.
And he has a picture of your house.
From the driveway.
It might be a little bit unnerving.
Was that legal?
I don't give a shit.
So anyway.
I don't have to have a court order.
Apparently no one else does.
So.
But anywho.
The open source side of it.
Can be a lot of fun.
One of the things that Noah is going to discuss.
Is finding outliers in the data.
Brian had enough for me to be able to find.
Had he said my name is John.
The problem would have been.
A little bit more difficult.
If he said yeah I work at Starbucks.
Okay.
Not as much of an outlier there.
But given time and effort.
And how much he pissed me off.
I probably would have found him eventually.
But based on the information.
It took me about 45 minutes to track him down.
So if you ever get bored.
And you're by the pool at the Bellagio.
Just wait for someone to come by.
You like talking.
Talking to guys at pools don't you?
Yeah.
Have you ever been given a wedgie on stage?
I would love that.
Okay.
You take the little microphone.
Okay.
Wow.
Sky claimed that I'm going to talk about.
A lot of things.
That I don't know where.
He got that from.
You're drunk.
You're really really drunk.
I know a little bit of math.
Some basic addition.
Subtraction stuff.
I'm not really going to talk about anything.
Really hard in advance.
That's for smart people.
Data.
Actually a lot of these slides.
Hello.
There's echo.
I don't like that echo.
Sorry.
Have fun.
Damn it.
Okay.
So these slides are semi-new to me.
But I think I did make them.
Let's go through them.
Data science.
This is a big field.
Data science.
The science of data.
Science has been around for a long time.
Data has been around for a long time.
You put them together.
Okay.
So it's emerged mostly over the past decade.
To being really like the real data science information scientist.
That's been a past decade kind of thing.
And it sort of came out of the whole business analytics, competitive intelligence.
Like everything else.
Driven by big business.
Because they're just looking out for our best interest.
And so all of a sudden people who are like statisticians.
Who are at the top.
Experts at data mining.
And all these types of advanced mathematical analyses.
Are very valuable to big businesses.
And other entities that like to analyze large data sets.
Are there other entities that collect lots of data?
None that I've heard of.
I haven't heard of any either.
But I'm sure there are organizations out there.
That are collecting lots of data.
And doing something with this.
Purely for benevolent reasons.
What's that?
For benevolent reasons.
Yeah, exactly.
But it's mostly to enhance our shopping experience.
Right?
Like other people who bought this.
Also bought this.
Statistics.
You're given data.
You try to come up with a model.
Probability.
Given a model.
Let's try to predict the data.
Simple concept.
Okay.
Here's a little graphic.
Demonstrating what I just said.
And it's useless.
Okay.
Historic data model future.
Ignore.
Data sources.
Okay.
These are some random examples of readily available public data sets.
And we've actually gone from like having databases of information.
To databases that are cataloging the databases of information.
And it's increasing exponentially.
And my favorite was Wolfram.
Freebase.
I came across when I was searching for something else.
But apparently it's a database.
So.
I also like Infochimps, too.
I don't know why.
It's just a funny name.
Okay.
Big data.
Not just data, but big data.
Buzzword.
Who thinks it's a buzzword?
Not that.
Oh, my God.
Some people and the other people think it's really like a legitimate real thing.
Okay.
That's cool.
I don't judge.
Well.
I don't know.
I mean, it's hard to define what that really means.
Big data.
Like, you know.
Is it big data?
Is it in the cloud?
It's a large typeface.
What's the cutoff for being big?
You know.
When does it become really big?
Sky, how big is your data?
My data is huge.
Okay.
I work with a very small data set.
And I'm okay with that.
But.
And at this point, this is yet another presentation we cannot put in our portfolio for public speaking.
Oh, boy.
That's true.
So, technically, at least what I've found is that it's sort of defined as big data.
These incredibly large amounts of data that are being rapidly generated and have lots of variability.
Okay.
You know.
Sure.
But it's still big data.
But the interesting thing about it from our perspective is that the creation of big data has also sort of brought forth the development of tools to work with big data.
To analyze these big data sets.
Visual representation.
Doing number crunching on them.
So all these new mathematical and advanced platforms for performing all kinds of functions on big data.
Which is of interest to us.
And we're going to look at that in a few minutes.
Or not.
Okay.
Terminology.
That means sort of that of defining words.
Kind of.
Okay.
We Google it backstage.
A lot of Googling.
All right.
So depending who you talk to or what publication you read or, you know, what book.
Anonymization.
De-identification.
Basically mean the same thing.
De-anonymization.
Re-identification.
Basically mean the same thing.
Kind of.
And there will be some, again, some studies, some groups that will distinguish for the purposes of our talk.
It's, yeah.
They're synonymous.
Antonyms.
Opposite meaning.
Yeah.
So.
The.
So you reverse one of these processes.
You get to the other.
Pretty simple.
Anyone with fifth grade background should get that.
Okay.
Sweet.
On.
Moving on.
Okay.
See, this is real simple stuff.
Okay.
Data.
When it's initially collected, a lot of times it contains personally identifiable information like social security number or address or something else.
Your name.
That would be personally identifiable.
So there needs to be some kind of process that takes this data and makes it sort of anonymous.
I love you, too.
What was that, 10?
No, it's 10.
Holy crap.
Okay.
Dude, you took up all the damn time.
Damn.
Wow.
Okay.
So we need to find a way to make this personally identifiable information.
Why?
Okay.
Make it into anonymous public data.
So there's a couple different ways that can be done in general.
Just removing variables altogether.
A variable that actually is unique enough to be identifying by itself.
Like, you know, I've had eight kids and been in porn.
That's, you know, Octomom or something.
Whatever.
Just remove those.
Global recoding.
Local suppression where, again, recoding certain variables or suppressing certain values in different columns that are really identifiable.
A whole bunch of different ways.
Okay.
Anonymization metrics.
We have to figure out a way to look at the way we anonymize data and figure out, hey, is this working?
Is this, like, actually making the data anonymous?
And at the same time, making it usable.
So the whole utility versus actual anonymity.
I mean, that's a balance right there.
So two metrics.
Disclosure risk.
Likelihood of revealing data in the public set.
And then information retention.
How, you know, how does this work?
The utility of that data.
So we take away all this information.
Ah, it's anonymous.
But is it still usable?
So that's the balance you have to strike.
Yeah.
It's a tough problem.
You want to minimize disclosure risk, maximize information retention.
Easier said than done.
But information entropy.
Anyone familiar with this?
Entropy?
Yes.
Yes.
And not the entropy from thermodynamics, which I spent a long semester trying to go through.
So, yeah.
Information theory.
So the idea is the, oh, my God.
Ten minutes.
I have, like, a million slides to go through.
So I should just start.
Basically, the amount of information that can be, the number of states that can reveal the total number of possibilities for a given data.
Like the, I actually use an eight-sided die in an example that obviously you can roll and you get, like, one through eight.
Because it's got eight sides.
So, yeah.
And the information entropy is going to be three bits.
And, yeah.
So population of the world, let's just say 8 billion.
That's like 33 bits.
Awesome website.
33bits.org.
Very good.
Anyway.
Audience participation.
Everyone just get up and participate in some way real quick.
Because we've got to do something.
No, I don't know.
No, no.
Should we do this?
Do we have time for this or what?
I think we have all the time we want.
Really?
You got that poll?
I didn't do that.
I don't know.
I'm wrong.
Let me get the radio and get a couple of red shirts in there.
Yeah.
I'm going to go through and sort people out based on some criteria.
We can skip it if you want.
Or if you want to stand up and raise your hand.
Do you want to do that?
Okay.
All right.
First question.
Everyone here who this is their first time attending DEF CON, please stand up.
Noob.
All right.
Noob, noob, noob, noob.
Now come up with a West Coast, East Coast or age cut off.
Okay.
Anyone from the East Coast?
Stay standing.
Everyone else, sit down.
You guys paid the highest airfares.
Thank you very much.
We enjoyed that.
Yours.
Anyone here from New Jersey?
Up.
Wait.
I didn't say what to do.
I just said anyone from New Jersey up?
Simon says.
No, no.
Okay.
You can sit down.
Okay.
What do we got?
Like 7, 8, 10 people?
What are the states below New Jersey?
No, no, no.
I was going to say you had a hangover.
But I guess it's not publicly available data unless we query everyone in the room.
Yeah.
I would say anyone who is male, stay standing.
But that's pretty much everyone.
Any female?
Raise your hand.
Shitty data set.
Never mind.
No.
Actually, that would be the unit.
There we go.
Okay.
I'll tell you what.
Anyone say 29 years of age or younger, remain standing.
All the old folks in the room, sit down.
That's good.
And we got left.
One, two, three, four.
Yours.
Oh, man.
Anyone here from sort of living below New Jersey?
North Carolina, South Carolina border.
Sit down.
Did we do New Jersey and up?
Yeah.
So we're now between like North Carolina, Jersey.
So.
Who do we have?
You said New Jersey and up.
Still stay standing.
So you're in the upper quadrant there.
So I did age.
We can't do male, female.
Who got laid last night?
Okay.
That's a bad data set, too.
So how many people are we up to?
Who is remaining standing?
Count them off.
I can't see for the lights.
How many people do you think are in this room right now?
Seven, 800, 1,000.
Something like that.
I don't know.
Of that, we're down to what?
Four people?
Three people who remain standing?
And how many questions?
Well, that was.
Five questions?
Well, it was.
So it was maybe, what, four or five questions.
But the entropy for those questions.
So what?
North, west coast, east coast.
Entropy there is one bit.
We had.
What was the other question?
.
I do.
First time at DEF CON.
First time at DEF CON.
Information entropy there is.
Two bits.
Two bits.
Anyone above, what, New Jersey and above?
Is that what you said?
Yeah, pretty much.
Actually, I think all the questions were like two bit.
Yeah, entropy questions.
So five.
Yeah.
So basically five bits of entropy.
And we were able to narrow down the population to what?
Three people.
Three, four people.
And it's all innocuous information.
But the point is that the combination of all this
innocuous information can actually be quite identifiable.
Yeah.
So a round of applause for yourselves.
Thank you.
Okay.
So how much time left?
Just keep going.
Three?
Okay.
I have 20 slides to do in three minutes.
Okay.
Thank you, Scott.
I appreciate that.
Outliers.
Value traits.
Anything outside of normal distribution.
If you have them in combinations or sets which are unique.
A little bit trickier to detect.
But mathematically possible.
Graphical example of an outlier.
This is an IQ of probably here, everyone in the audience.
And it was an outlier.
Kind of.
I'm special.
Data set intersections.
Venn diagrams.
Who's heard of them?
Yeah.
Okay.
You have sets of data.
You have set A, set B.
What's the intersection there?
A.
Look at that.
A and B.
Amazing.
Now you add C.
Look what you have.
A and C.
B and C.
And what's in the middle?
Holy crap.
Isn't that amazing?
That's a math joke, isn't it?
That's the math thing happening.
Okay.
Unique variable overlap.
You know what?
If you have outliers for different types of data.
You know what?
Just move on.
Mathematical attacks with three minutes.
Yeah.
That's not.
Slow down.
Just do it.
I got it covered.
Sweet.
Inferential analysis.
An example of it.
Remember the targeted advertising?
The teenage woman who was pregnant and was getting all this targeted advertising based on her purchasing behavior to her household.
And then her dad was upset that she was getting targeted ads for like Enfamil and diapers.
This is my teenage daughter.
She's not pregnant.
And got all pissed off at the manager.
Anyway, she was pregnant.
And that's how he found out.
That's not a good way to tell your parents you're pregnant.
I'm sorry.
Yeah.
That's not how I'll tell my parents.
No.
So database linkage.
Classic example is the whole Netflix IMDB thing.
That was.
Yeah.
I'm sure you all remember that.
Okay.
U.S. Census data.
This happens.
They don't knock on your door anymore.
I think there's like.
Do they?
I don't think they.
When did they stop knocking on your door?
I don't answer the door.
Me and my 12 roommates.
Do they really?
Man.
All right.
Another reason not to answer the door.
Actually.
So a researcher in 1990.
This Latanya Sweeney came up with a way to actually just using information from the census data.
Which was date of birth, gender, zip code.
87% of the population was unique.
Amazing.
Based on principles of information entropy.
Amazing.
Exposed healthcare records of the governor of Massachusetts at the time.
Which is kind of funny.
Screw you, Weld.
And applied entropy.
So how she did it.
I mean, zip code.
There's 43,000 zip codes in the U.S. roughly.
Birth dates, 365.
Birth year.
About 70 different.
Age range of 70 in two different genders.
Hermaphrodites were excluded.
So 30 bits of entropy.
Which includes all the population in the U.S.
Simple as that.
PGP.
Ever heard of PGP?
Personal genome project.
Okay.
So this is another program going on where people voluntarily submit all this genetic information about themselves.
They want to correlate genotype, phenotype to learn about themselves.
Oh, dude.
Look.
Anyway.
Again.
Projects have gone bad.
Yeah.
No one saw that.
That didn't happen.
Pass.
Yeah.
Record linkage.
Is this a cool diagram?
Take care of him, dude.
He's stressing me out.
Record linkage.
So this is where you have a public data set and a private data set.
Public data set maybe has metadata that's publicly available and might have some innocuous but identifying information about an individual.
The private data set, well, that's got personally identifiable information that you don't want people to know.
The record linkage, it's possible to actually correlate the two and discover sort of these anonymous or so-called anonymous traces.
So I'm going to show you some traits about a person by combining the two data sets.
And I'll get to them mathematically, how to do that in a second.
Or not if I get kicked off stage.
So, all right.
Flying through these slides.
Vectors.
This is where it gets really, this is where I get into the math.
So either go to sleep or those who, anyone math-porn?
Okay.
Your data points now become a vector.
Your record.
Your data.
Attributes.
Yeah.
Boom.
Okay.
We're now with the only vector math.
Take it one step further.
The whole database.
It's a matrix.
Boom.
Records, people, attributes, database.
Okay.
Cool.
And, again, we now can apply matrixy math to this.
Matrixy inversions.
Dot products.
Actually measuring the angular difference between two vectors or matrices can actually find the similarities in large data sets.
Yeah.
Boring, boring, boring, boring.
Math, math, math.
The one cool thing that we did do is, hold on.
So, well, this is the actual mathematical formula for the similarity function in case any of you want to try this at home or see me after class and we'll discuss it.
Yeah.
Venn diagrams.
This is really cool.
So, we had to understand and represent and identify overlapping data sets.
We had two data sets.
A, B.
Multiple variables that were in common that were the same descriptive traits.
Looked at the intersections of them.
Noted here.
By these little lines across.
Okay.
So, these data sets, independent descriptive variables, and they're in common.
Then we take those little sections that are in common.
Okay.
And we Venn the Venns.
As we say.
So, take those and watch this.
Bam, bam, bam.
Bam!
Right there.
Look at that.
And then, based on that, we can actually now, actually the subspace defined by that area is the intersection of all of these groups and actually identifies records for which all attributes are identical.
And actually identifies an actual person.
Wait.
We got.
Okay.
So.
And summation.
The rising side of dark.
Dark side of Ozen.
So, yeah.
Emergence of big data.
Big problem.
Big data.
Lots of tools are being used for analysis and visualization.
More data sets are being developed.
And this is the mathematical attacks are going to become easier and easier.
It's another weapon for social engineering toolkits.
Because this is information about individuals that we're going to be able to ascertain.
And they're not going to be aware of it.
And they're not voluntarily giving this information.
But it's going to be actually sort of reidentified about them from these anonymous data sets.
And so, cool for us.
Bad for them.
But it's going to be actually sort of reidentified about them from these anonymous data sets.
What can we do to defend against the dark arts?
Proper sanitization methods.
There are not.
There's no way to.
There's no standards to actually implement anonymization metrics that actually provide the utility requirements.
But also provide true anonymity.
They don't exist.
So we need access controls.
Or my recommendation is to falsify everything and just make shit up.
So that's.
I would do.
And conclusion.
Questions and answers will be handled at the bar.
You guys are buying.
Ladies and gentlemen.
The full presentation will be seen at Skydog Con later this year.
Absolutely.
Can I have a round of applause for the speaker goons for letting us go a little long.
Thank you.
How can we take out Skydog and his buddies?
I'm sorry.
It's okay.
Totally cool.
